[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le by Akashcodes732 · Pull Request #35081 · vllm-project/vllm

Akashcodes732 · 2026-02-23T07:37:55Z

Purpose

Removes the check for POWERPC in vllm/engine/arg_utils.py to enabled chunked prefill and prefix caching

Test Plan and Result

Server ran with Prefix Caching Enabled

vllm bench serve \
    --backend openai \
    --model ibm-granite/granite-3.3-8b-instruct \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --prefix-repetition-prefix-len 512 \
    --prefix-repetition-suffix-len 128 \
    --prefix-repetition-num-prefixes 5 \
    --prefix-repetition-output-len 128

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  186.79    
Total input tokens:                      64000     
Total generated tokens:                  11202     
Request throughput (req/s):              0.54      
Output token throughput (tok/s):         59.97     
Peak output token throughput (tok/s):    196.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          402.61    
---------------Time to First Token----------------
Mean TTFT (ms):                          53624.07  
Median TTFT (ms):                        61559.23  
P99 TTFT (ms):                           77142.36  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1205.40   
Median TPOT (ms):                        1087.50   
P99 TPOT (ms):                           3717.79   
---------------Inter-token Latency----------------
Mean ITL (ms):                           1070.85   
Median ITL (ms):                         869.89    
P99 ITL (ms):                            15577.77

Ran server with prefix caching disabled

vllm serve ibm-granite/granite-3.3-8b-instruct --max-model-len 4096 --max_num_batched_tokens 4096 --no-enable-prefix-caching

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  425.90    
Total input tokens:                      64000     
Total generated tokens:                  11483     
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         26.96     
Peak output token throughput (tok/s):    190.00   
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          177.23    
---------------Time to First Token----------------
Mean TTFT (ms):                          166706.25 
Median TTFT (ms):                        160896.68 
P99 TTFT (ms):                           313166.60 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3161.35   
Median TPOT (ms):                        2086.99   
P99 TPOT (ms):                           19825.83  
---------------Inter-token Latency----------------
Mean ITL (ms):                           2122.36   
Median ITL (ms):                         916.20    
P99 ITL (ms):                            19883.92

Fixed Prompt with Prefix Caching

python benchmarks/benchmark_prefix_caching.py \ 
 --model ibm-granite/granite-3.3-8b-instruct \
 --enable-prefix-caching \
 --num-prompts 1 \
 --repeat-count 100 \
 --input-length-range 128:256

Testing filtered requests
------start generating------
Rendering prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 544.56it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:09<00:00, 10.69it/s, est. speed input: 2757.76 toks/s, output: 106.89 toks/s]
cost time 9.541510581970215

ShareGPT Dataset with Prefix Caching

python benchmarks/benchmark_prefix_caching.py \ 
  --model ibm-granite/granite-3.3-8b-instruct \
  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
  --enable-prefix-caching \
  --num-prompts 20 \
  --repeat-count 5 \
  --input-length-range 128:256

Testing filtered requests
------start generating------
Rendering prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1102.57it/s]
Processed prompts:   0%|                                                                                                                                   | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 02-22 08:39:06 [loggers.py:259] Engine 000: Avg prompt throughput: 201.2 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 39 reqs, Waiting: 61 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 37.9%
INFO 02-22 08:39:24 [loggers.py:259] Engine 000: Avg prompt throughput: 216.8 tokens/s, Avg generation throughput: 5.5 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.8%, Prefix cache hit rate: 55.8%
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:45<00:00,  2.20it/s, est. speed input: 404.27 toks/s, output: 22.02 toks/s]
cost time 45.508230686187744

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

gemini-code-assist

Code Review

This pull request enables chunked prefill and prefix caching for the PowerPC (ppc64le) architecture by removing the explicit architecture check in the engine configuration. The provided benchmark results demonstrate successful operation and performance improvements on this hardware. However, the refactoring is incomplete as the log messages within the associated code block still incorrectly list 'POWER' as an unsupported architecture, which will lead to misleading information for users on other platforms like s390x or RISC-V.

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

Akashcodes732 · 2026-02-23T09:38:13Z

Hi @bigPYJ1151 ,

Can you please take a look at this PR ?

Akashcodes732 · 2026-02-24T11:28:45Z

Hi @bigPYJ1151 ,

Can you please look at the changes ?

mergify · 2026-02-24T16:31:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akashcodes732.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bigPYJ1151 · 2026-02-25T06:09:42Z

Hi @Akashcodes732 There are some conflicts need to resolve :)

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

Akashcodes732 · 2026-02-25T07:06:46Z

Hi @bigPYJ1151 ,

I have fixed the merge conflicts.

Akashcodes732 · 2026-02-25T12:14:48Z

Hi @bigPYJ1151 ,

The fails look unrelated to the fix, can you please suggest ?

Akashcodes732 · 2026-02-26T05:34:58Z

HI @bigPYJ1151 ,

I think you need to approve again :)

…4le (vllm-project#35081) Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com> Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

Feat: Enable prefix caching and chunked prefill for ppc64le

ca61b6a

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

Comment thread vllm/engine/arg_utils.py Outdated

Fix: Update comment

57ca2f3

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

bigPYJ1151 self-assigned this Feb 24, 2026

mergify bot added the needs-rebase label Feb 24, 2026

bigPYJ1151 approved these changes Feb 25, 2026

View reviewed changes

Fix: Merge conflict

5d11198

Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com>

mergify bot removed the needs-rebase label Feb 25, 2026

Merge branch 'main' into feat/enable_pc_cp_ppc

4221da7

bigPYJ1151 enabled auto-merge (squash) February 25, 2026 08:16

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 25, 2026

Merge branch 'main' into feat/enable_pc_cp_ppc

3d3ce0a

bigPYJ1151 disabled auto-merge February 25, 2026 15:37

bigPYJ1151 enabled auto-merge (squash) February 25, 2026 15:38

bigPYJ1151 merged commit e03ddcf into vllm-project:main Feb 26, 2026
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le#35081

[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le#35081
bigPYJ1151 merged 5 commits intovllm-project:mainfrom
Akashcodes732:feat/enable_pc_cp_ppc

Akashcodes732 commented Feb 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Akashcodes732 commented Feb 23, 2026

Uh oh!

Akashcodes732 commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

bigPYJ1151 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Akashcodes732 commented Feb 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan and Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Akashcodes732 commented Feb 23, 2026

Uh oh!

Akashcodes732 commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

bigPYJ1151 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 25, 2026

Uh oh!

Akashcodes732 commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Akashcodes732 commented Feb 23, 2026 •

edited by github-actions bot

Loading